import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import Legend
from bokeh.models import FactorRange, Legend
import folium
from branca.element import Template, MacroElement
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.size'] = 12
Yelp Reviews and Restaurants¶
Motivation¶
We are working with two datasets from Yelp. One containing a subset of businesses in USA and Canada and one containing reviews of these businesses. We chose this dataset as we thought it would be fun and interesting to analyse review trends both time-wise and geographically. Furthermore, the datasets are large enough and contain enough information to hopefully gain alot of fun insights. As the datasets contain only a subset of the cities on yelp we chose to work specifically on the Philidelphia part of the dataset as this was the largest city present.
Yelp reviews are important because they help people decide where to eat. A 2022 article from Propel [^1] explains that Yelp has a big influence on small businesses in Philadelphia, and that star ratings are one of the first things new customers notice.
Another article from Yelp for Business [^2] shows that 96% of people on Yelp compare different places before deciding where to go. This supports our goal of helping readers explore their options and make a good choice using Yelp data. Additionally, the article tells that many users take action quickly, often reaching out to a business the within a day of researching its reviews. This shows that reviews aren’t just opinions but they actually affect real decisions people make right away.
A Medium article by an economist [^3] also points out that Yelp reviews reflect more than just food quality. They can show things like the customer’s expectations, their mood, or even the time they visited.
In the end, we want our project to help readers see useful patterns in Yelp reviews and guide them toward the best places and at the best times in Philadelphia.
Basic Stats¶
In this section we cover the two datasets and out choices in data preparation and preprocessing as well as some basic stats of the data.
Data: Businesses¶
The full business dataset contains ~150.000 businesses and for each of these it has:
- Name
- Business ID
- Location
- Average rating from 1-5
- Number of reviews
- Categories (Hotel, Restaurant, etc.)
- Various attributes (Such as parking or payment options)
- Opening hours
Firstly we want to filter the business data to contain only those in Philadelphia:
# Load and filter the business dataset
df_business = pd.read_json('../data/yelp_academic_dataset_business.json', lines=True)
df_business = df_business[df_business['city'] == 'Philadelphia']
print(f"Number of businesses in Philadelphia: {len(df_business)}")
print("Sample data from the business dataset:")
# Display the first row of the filtered dataset
df_business.head(1)
Number of businesses in Philadelphia: 14569 Sample data from the business dataset:
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... |
We then want to find those that are labeled as restaurants:
restaurant_ids = set()
for _, b in df_business.iterrows():
if b['categories'] and 'Restaurants' in b['categories']:
restaurant_ids.add(b['business_id'])
print(f"Number of restaurants in Philadelphia: {len(restaurant_ids)}")
Number of restaurants in Philadelphia: 5852
df_businesses = df_business
bin_width = 0.5
min_star = df_businesses['stars'].min()
max_star = df_businesses['stars'].max()
bin_edges = np.arange(min_star - bin_width/2, max_star + bin_width, bin_width)
plt.figure(figsize=(10, 6))
sns.histplot(df_businesses['stars'], bins=bin_edges, color='red')
# Set xticks at each star rating level
plt.xticks(np.arange(1.0, 5.1, 0.5)) # Ticks from 1.0 to 5.0 by 0.5 steps
plt.xlabel('Star Rating')
plt.ylabel('Number of Buisinesses')
plt.title('Distribution of Star Ratings for Restaurants in Philadelphia')
plt.show()
It can in the histogram be seen that the distribution is quite skewed to the right with most restaurants having a rating above 3.
Data: Reviews¶
The full review dataset contains ~7.000.000 reviews and for each of these it has:
- Review ID
- User ID
- Business ID
- Rating
- Other users' opinion of the review
- Textual review
- Date and time of day
Firstly, we want to look solely on the reviews for restaurants which we have data on:
df_reviews = pd.read_json('../data/philadelphia_restaurant_reviews.json', lines=True)
# Filter reviews to include only those for restaurants in Philadelphia
df_reviews = df_reviews[df_reviews['business_id'].isin(restaurant_ids)]
print(f"Number of reviews for restaurants in Philadelphia: {len(df_reviews)}")
print("Sample data from the reviews dataset:")
# Display the first row of the filtered dataset
df_reviews.head(1)
Number of reviews for restaurants in Philadelphia: 687289 Sample data from the reviews dataset:
| review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AqPFMleE6RsU23_auESxiA | _7bHUi9Uuf5__HHc_Q8guQ | kxX2SOes4o-D3ZQBkiMRfA | 5 | 1 | 0 | 1 | Wow! Yummy, different, delicious. Our favo... | 2015-01-04 00:01:03 |
We see that we have just under 700.000 reviews of restaurants in Philadelphia. When looking at the distribution of reviews we again see a clear trend that most reviews are either 4 or 5 stars. This correlates well with the distribution of restaurant ratings.
df = df_reviews
df_businesses = df_businesses[df_businesses['city'] == 'Philadelphia']
# Filter the businesses to only include restaurants
df_businesses = df_businesses[df_businesses['categories'].str.contains('Restaurants', na=False)]
# Histogram of the number of reviews per star rating
plt.figure(figsize=(10, 6))
sns.histplot(df['stars'], bins=5, color='blue', label='Star Rating', discrete=True)
plt.xticks([1, 2, 3, 4, 5]) # Ensure ticks are at the center of each bar
plt.xlabel('Star Rating')
plt.ylabel('Number of Reviews')
plt.title('Distribution of Star Ratings of Restaurant Reviews in Philadelphia')
plt.show()
Data Analysis¶
Distribution of Ratings¶
Initially we want to look at the distribution of ratings for the restaurants in Philadelphia. We wish to see if reviewers generally give high or low ratings. From the histograms in Basic Stats we already see that there appears to be a trend that most restaurants have a rating above 3.0. to examine this further we will showcase the distribution of ratings for all restaurants in Philadelphia using a boxplot.
# Create boxplot of star ratings for restaurants
plt.figure(figsize=(10, 6))
sns.boxplot(x='stars', data=df_businesses, color='red')
plt.xticks(np.arange(1.0, 5.1, 0.5)) # Ticks from 1.0 to 5.0 by 0.5 steps
plt.xlabel('Star Rating')
plt.ylabel('Number of Buisinesses')
plt.title('Boxplot of Star Ratings for Restaurants in Philadelphia')
plt.show()
By looking at a boxplot of the stars of the restaurants we can see that 75% of the restaurants have a rating above 3.0 and that the median is 4.0. This does seem to indicate that people are generally happy with the restaurants in Philadelphia.
# Create boxplot of star ratings for restaurant reviews
plt.figure(figsize=(10, 6))
sns.boxplot(x='stars', data=df, color='blue')
plt.xticks([1, 2, 3, 4, 5]) # Ensure ticks are at the center of each bar
plt.xlabel('Star Rating')
plt.ylabel('Number of Reviews')
plt.title('Boxplot of Star Ratings of Restaurant Reviews in Philadelphia')
plt.show()
For reviews we again see a very similar boxplot to the one we saw for the restaurant ratings. Again we have that 75% of the reviews are above 3.0 and the median is 4.0. But now, as seen by the 3rd quartile, there are even more ratings of 5 stars.
Impact of Categories on Rating¶
The dataset contains numerous restaurants of many different categories that we wish to explore further. We will do this by looking at the distribution of ratings for each category. The categories analyzed will be the 23 most common ones in the buisness dataset as well as 'cheesesteaks' as this is a very popular dish in Philadelphia. It should be noted that 'Restaurant' and 'Food' cartegories are not included as they are too broad and would not provide any useful information.
# split the categories into separate columns
df_businesses['categories'] = df_businesses['categories'].str.split(',')
df_businesses = df_businesses.explode('categories')
df_businesses['categories'] = df_businesses['categories'].str.strip()
# remove categories that appears less than 200 times
category_counts = df_businesses['categories'].value_counts()
categories_to_remove = category_counts[(category_counts < 200)].index
# remove cheesesteaks from the categories to remove
categories_to_remove = categories_to_remove[~categories_to_remove.str.contains('Cheesesteaks', na=False)]
df_businesses = df_businesses[~df_businesses['categories'].isin(categories_to_remove)]
# remove 'restaurants' and 'food' from the categories
df_businesses = df_businesses[~df_businesses['categories'].str.contains('Restaurants', na=False)]
df_businesses = df_businesses[~df_businesses['categories'].str.contains('Food', na=False)]
# select the top 25 categories
top_categories = df_businesses['categories'].value_counts().nlargest(25).index
# filter the dataframe to only include the top 25 categories
df_businesses = df_businesses[df_businesses['categories'].isin(top_categories)]
category_counts = df_businesses['categories'].value_counts()
plt.figure(figsize=(18, 8))
plt.bar(category_counts.index, category_counts.values, color='skyblue')
plt.xticks(rotation=75, ha='right')
plt.xlabel("Category")
plt.ylabel("Number of Restaurants")
plt.title("Number of Restaurants for Each Top Category")
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
To get an overview of the categories we start by looking at the number of restaurants in each category. It can be seen that the most popular categories are 'Nightlife' and 'Bars' followed by 'Sandwiches' and 'Pizza'.
# split the categories into separate columns
df_businesses['categories'] = df_businesses['categories'].str.split(',')
df_businesses = df_businesses.explode('categories')
df_businesses['categories'] = df_businesses['categories'].str.strip()
# remove categories that appears less than 200 times
category_counts = df_businesses['categories'].value_counts()
categories_to_remove = category_counts[(category_counts < 200)].index
# remove cheesesteaks from the categories to remove
categories_to_remove = categories_to_remove[~categories_to_remove.str.contains('Cheesesteaks', na=False)]
df_businesses = df_businesses[~df_businesses['categories'].isin(categories_to_remove)]
# remove 'restaurants' and 'food' from the categories
df_businesses = df_businesses[~df_businesses['categories'].str.contains('Restaurants', na=False)]
df_businesses = df_businesses[~df_businesses['categories'].str.contains('Food', na=False)]
# select the top 25 categories
top_categories = df_businesses['categories'].value_counts().nlargest(25).index
# filter the dataframe to only include the top 25 categories
df_businesses = df_businesses[df_businesses['categories'].isin(top_categories)]
df_group = df_businesses.groupby(['categories', 'stars']).size().reset_index(name='counts')
category_counts = df_businesses.value_counts('categories')
for index, row in df_group.iterrows():
category = row['categories']
df_group.at[index, 'normalized_counts'] = row['counts']/category_counts[category]
df_group = df_group.pivot(index='stars', columns='categories', values='normalized_counts')
source = ColumnDataSource(df_group)
# stars to string
stars = [str(star) for star in np.sort(df_businesses['stars'].unique())]
p = figure(x_axis_label = 'Stars', y_axis_label = 'Normalized Count', width=800, height=700)
p.title.text = 'Distribution of Star Ratings for various restaurant categories in Philadelphia'
bar ={} # to store vbars
categories = df_group.columns.tolist()
HighContrast27 = [
"#1f77b4", # Blue
"#ff7f0e", # Orange
"#2ca02c", # Green
"#d62728", # Red
"#9467bd", # Purple
"#8c564b", # Brown
"#e377c2", # Pink
"#7f7f7f", # Gray
"#bcbd22", # Olive
"#17becf", # Teal
"#393b79", # Dark Blue
"#637939", # Army Green
"#8c6d31", # Dark Tan
"#843c39", # Burgundy
"#7b4173", # Plum
"#3182bd", # Sky Blue
"#f33e52", # Bright Red
"#31a354", # Emerald Green
"#756bb1", # Violet
"#636363", # Charcoal Gray
"#fdae6b", # Light Orange
"#a55194", # Magenta
"#9ecae1", # Light Blue
"#e7ba52" # Mustard
]
legend_items = []
for indx,category in enumerate(categories):
vbar = p.vbar(x='stars', top=category, source=source,
### we will create a vbar for each focuscrime
muted_alpha=0, alpha = 0.7, width = 0.5, color=HighContrast27[indx], muted = False if indx==6 or indx==16 or indx==19 or indx==21 else True)
legend_items.append((category, [vbar]))
legend = Legend(items=legend_items, click_policy="mute")
p.add_layout(legend, 'left')
output_notebook()
show(p)
From the 'Distribution of Star Ratings of Restaurant Reviews in Philadelphia by Category'-plot we can see that the distribution of ratings can vary quite a bit between different categories. For example, the 'Burgers' category has a spread between the restaurants within the category than, for example, the 'Japanese' category.
# Create a boxplot of each average star rating for each category
plt.figure(figsize=(15, 8))
sns.boxplot(x='categories', y='stars', data=df_businesses, palette=HighContrast27, hue='categories', legend=False)
plt.xticks(rotation=75, ha='right')
plt.xlabel('Restaurant Category')
plt.ylabel('Star Rating')
plt.title('Boxplot of Star Ratings for Restaurant Categories in Philadelphia')
for i in range(24):
plt.gca().lines[4+6*i].set_linewidth(2)
plt.gca().lines[4+6*i].set_color('black')
plt.show()
Looking at the boxplot we can see that the majority of the categories have a relatively low spread and a median rating of 3.5 or higher. Interestingly, the categories with the a higher spead are 'Pizza', 'Burgers' and 'Chicken Wings'. This could indicate that there are some restaurants in these categories that are rated very low. For 'Burgers' and 'Chicken Wings' we can observe the lowest median rating of 3.0, maybe implying that the restaurants in these categories are not as good as the others or that the reviewers are more critical of them. Interestingly, 'Pizza' is one of the most represented categories in Philadelphia which could indicate that the quality of pizza varies quite a bit.
Time Analysis¶
In this part of the analysis, we look at restaurant reviews from Philadelphia to find out how time of day affects reviews.
We want to answer a few questions:
When during the day do people leave the most reviews?
Does the time of review change how good or bad the review is?
Do some businesses get different kinds of reviews depending on the time?
To find out, we made different plots to show the trends and patterns in the data.
To properly interpret these results, it's important to keep two things in mind:
- People don’t always write reviews while they’re at the restaurant.
- Reviews can be posted outside of the business’s opening hours, sometimes long after the visit took place.
Because of this, we should be cautious about drawing strong conclusions based purely on the review timestamps. They may not fully reflect the actual time of the dining experience.
# Knowing that the JSON file is structured with each line as a separate JSON object
philly_reviews_df = pd.read_json('../data/philadelphia_restaurant_reviews.json', lines=True)
# Extract the hour from the 'date' column
philly_reviews_df['hour'] = philly_reviews_df['date'].dt.hour
# Group by hour and count the number of reviews for each hour
reviews_per_hour = philly_reviews_df.groupby('hour').size()
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.size'] = 12
plt.figure(figsize=(10,6))
reviews_per_hour.plot(kind='line', marker='o')
plt.title('Number of Reviews per Hour of the Day')
plt.xlabel('Hour of the Day (0-23)')
plt.ylabel('Number of Reviews')
plt.grid(True)
plt.xticks(range(0,24))
plt.show()
To begin, we created a line chart that shows how the total number of reviews changes depending on the hour of the day. From the plot, it's clear that most reviews are written between 4 PM and 2 AM. This makes sense, since restaurants are usually busiest during dinnertime [^4] and late at night, when more people are out eating or grabbing late-night food and more customers often means more reviews.
Line chart of different star rating distribution depending on the time of day¶
We wanted to see when people give different star ratings during the day. To do this, we made one line chart for each rating (from 1 to 5) showing the hours when reviews were written.
For this plot, we used min-max normalization so that each rating is scaled between 0 and 1. The focus is on when people leave reviews, not how many. With normalization, we can better compare the timing patterns for all ratings, even the rare ones like 1-star.
# Group by star and hour
grouped = philly_reviews_df.groupby(['stars', 'hour']).size().unstack(fill_value=0)
grouped = grouped.loc[grouped.index.intersection([1, 2, 3, 4, 5])]
# Min-max normalization
minmax_normalized = (grouped - grouped.min(axis=1).values[:, None]) / \
(grouped.max(axis=1) - grouped.min(axis=1)).values[:, None]
minmax_normalized = minmax_normalized.fillna(0)
# Set font style
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.size'] = 12
# Setup subplots
fig, axes = plt.subplots(3, 2, figsize=(12, 10))
axes = axes.flatten()
# Plot each star rating
for i, (star, row) in enumerate(minmax_normalized.iterrows()):
ax = axes[i]
row.plot(kind='line', marker='o', ax=ax, color='darkorange')
ax.set_title(f"{int(star)}-Star Reviews")
ax.set_xlabel("Hour of the Day")
ax.set_ylabel("Relative Frequency")
ax.set_xticks(range(0, 24))
ax.grid(True)
fig.delaxes(axes[-1])
# Layout and title
plt.tight_layout()
plt.suptitle("Min-Max Normalized Hourly Distribution by Star Rating", fontsize=16, y=1.02)
plt.show()
From the charts, we can see that most reviews are written in the evening and late at night, while mornings have the fewest reviews across all star ratings.
Interactive chart of ratings depending on the time of day¶
Secondly, we wanted to explore whether the star ratings people give change depending on the time of day. To investigate this, we created an interactive Bokeh chart that allows users to toggle between different 3-hour time blocks and see how the distribution of star ratings changes throughout the day.
output_notebook()
HighContrast10 = [
"#1f77b4", # Blue
"#ff7f0e", # Orange
"#2ca02c", # Green
"#d62728", # Red
"#9467bd", # Purple
"#8c564b", # Brown
"#e377c2", # Pink
"#7f7f7f", # Gray
"#bcbd22", # Olive
"#17becf" # Teal
]
# Prepare data
philly_reviews_df['date'] = pd.to_datetime(philly_reviews_df['date'])
philly_reviews_df['hour'] = philly_reviews_df['date'].dt.hour
# Create 3-hour time blocks
def map_three_hour_block(hour):
start = (hour // 3) * 3
end = start + 2
return f"{start:02d}-{end:02d}"
philly_reviews_df['three_hour_block'] = philly_reviews_df['hour'].apply(map_three_hour_block)
# Count reviews per (star, time block)
grouped = philly_reviews_df.groupby(['stars', 'three_hour_block']).size().unstack(fill_value=0)
# Make sure columns are sorted by time
time_blocks = sorted(grouped.columns)
grouped = grouped[time_blocks]
for star in range(1, 6):
if star not in grouped.index:
grouped.loc[star] = [0] * len(time_blocks)
grouped = grouped.sort_index()
grouped.index = grouped.index.map(str)
stars = grouped.index.tolist()
grouped.reset_index(inplace=True)
grouped.rename(columns={'stars': 'stars'}, inplace=True)
source = ColumnDataSource(grouped)
p = figure(x_range=FactorRange(*stars),
height=500, width=900,
title="Review Counts per Star Rating across 3-Hour Blocks",
toolbar_location="above", tools="pan,wheel_zoom,reset,save")
colors = HighContrast10
for i, block in enumerate(time_blocks):
p.vbar(x='stars',
top=block,
source=source,
width=0.2,
color=colors[i % len(colors)],
legend_label=block,
muted_alpha=0.1,
muted=True,
alpha=0.7)
p.xaxis.axis_label = "Star Rating"
p.yaxis.axis_label = "Number of Reviews"
p.xaxis.major_label_orientation = 1.0
p.x_range.range_padding = 0
p.y_range.start = 0
p.title.text_font_size = "14pt"
p.legend.location = "top_left"
p.legend.click_policy = "mute"
p.add_layout(p.legend[0], 'left')
show(p)
From this chart, users can easily toggle between different time slots to see how star ratings are distributed throughout the day. However, the overall pattern looks very similar across all time blocks:
- 1-star reviews are more common than 2-star reviews
- 5 stars is the most frequently given rating in all time slots
- 4-star reviews are more common than 3 stars, but less common than 5 stars
- 3-star reviews appear more often than both 1- and 2-star reviews
Time analysis of three specific businesses¶
For the next step, we wanted to dig deeper into individual businesses to see how star ratings vary throughout the day. To find businesses worth studying, we first filtered for businesses with at least 50 reviews to ensure the data was reliable. Then, we calculated the standard deviation of star ratings for each business and selected those with the highest variation. Businesses with a wide range of different reviews are especially interesting to investigate, as they might show stronger time-based trends. As an example, this can be used for Yelp users to determine at what time they should go to the restaurant.
business_df = pd.read_json('../data/yelp_academic_dataset_business.json', lines=True)
# Count number of reviews per business
review_counts = philly_reviews_df.groupby('business_id').size()
# Keep only businesses with at least 50 reviews
businesses_with_enough_reviews = review_counts[review_counts >= 50].index.tolist()
# Group reviews by business and hour, and calculate mean stars
business_hour_avg = philly_reviews_df.groupby(['business_id', 'hour'])['stars'].mean().reset_index()
# Calculate std for each business
business_variability = business_hour_avg.groupby('business_id')['stars'].std()
# Keep only businesses with enough reviews
business_variability = business_variability[business_variability.index.isin(businesses_with_enough_reviews)]
# Sort and pick top 3 businesses
top_variable_business_ids = business_variability.sort_values(ascending=False).head(3).index.tolist()
print(top_variable_business_ids)
print("Top 3 businesses with highest variation and their review counts:")
for business_id in top_variable_business_ids:
count = review_counts[business_id]
business_info = business_df[business_df['business_id'] == business_id]
business_name = business_info['name'].values[0]
print(f"- {business_name}: {count} reviews")
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.size'] = 12
plt.figure(figsize=(10, 15))
for i, business_id in enumerate(top_variable_business_ids, 1):
business_info = business_df[business_df['business_id'] == business_id]
business_name = business_info['name'].values[0]
reviews = philly_reviews_df[philly_reviews_df['business_id'] == business_id].copy()
reviews['date'] = pd.to_datetime(reviews['date'])
reviews['hour'] = reviews['date'].dt.hour
stars_per_hour = reviews.groupby('hour')['stars'].mean()
plt.subplot(3, 1, i)
stars_per_hour.plot(kind='line', marker='o')
plt.title(f'{business_name}')
plt.ylim(0, 5.5)
plt.grid(True)
plt.xticks(range(0, 24))
plt.ylabel('Average Stars')
plt.tight_layout()
plt.show()
['97SR7RQPL02t5J41UcZ4SQ', 'jPG_BuIKf0KBCFY6u00h-A', 'fT0lXvUz41XaZbgtSMqTKg'] Top 3 businesses with highest variation and their review counts: - Pat's Pizzeria: 51 reviews - Love Park Pizza and Chicken: 55 reviews - Mangiamo 444: 55 reviews
Pat's Pizzeria¶
Opening hours: 11:00 - 23:30
Pat’s Pizzeria shows a lot of variation during the day.
- Some early morning and midday reviews are as low as 1 star, while others reach 5 stars.
- The pattern is inconsistent, which could point to changing customer experiences depending on time. This could be caused by differences in staff or the state of the customers themselves (e.g. diner guests vs. lunch guests).
Based on the plot, a user might prefer to visit Pat’s Pizzeria either at lunchtime or late in the evening, as those are the times when reviews tend to be more positive.
Love Park Pizza and Chicken¶
Opening hours: 10:00 - 3:30
This business also shows significant variation in review scores throughout the day:
- Some time slots (such as 2 AM, 3 PM, and 8 PM) reach 5-star ratings, while others drop to just 1 or 2 stars.
- The ratings change sharply from hour to hour, which may reflect inconsistent service quality or differences in the type of customers at various times (e.g., post-party visitors vs. lunch diners).
Based on the trend, the best time to visit seems to be at the end of peak periods, like after lunch, dinner, or late-night hours. In contrast, this business appears to be a poor choice for breakfast, as early reviews tend to be lower.
Mangiamo 444¶
Opening hours: 16:00 - 21:30
This business has a very sharp pattern:
- During opening hours, the restaurants receive 3 stars or more.
- But during night and afternoon hours, ratings drop as low as 1 star.
Interestingly, the lowest ratings are typically posted outside of business hours. This could suggest that customers had time to reflect on their experience before writing a more critical review. However, there are also a few 5-star reviews posted in the morning, indicating that not all delayed reviews are negative.
Conclusion for the time analysis of three businesses¶
All three businesses show that review ratings can vary significantly throughout the day, and some may have consistent weak spots (e.g. breakfast time for the first two businesses). This kind of hour-by-hour analysis could be useful for businesses to identify problem times and improve consistency.
Grid Analysis¶
Next, we wish to explore how different areas of Philadelphia score in terms of average reviews. Specifically, we want to create an interactive map that, at a glance, showcases how areas compare to each other int terms of ratings to allow the user to pick out their next destination.
Furthermore, the map should allow filitering of categories such that user can find the best rated areas for the exact type of place they are looking for.
The way we do this, is by dividing the city into grids and the businesses into category clusters. Then for each grid get the average rating in each grid and use a color map to visualize the ratings in an intuitive way.
# Step 1: Divide the city in grid cells
# Grid size constant
GRID_SIZE = 0.01 # 1 km approximately
# Grid creation
min_lon, min_lat = df_business['longitude'].min(), df_business['latitude'].min()
df_business['grid_x'] = ((df_business['longitude'] - min_lon) // GRID_SIZE).astype(int)
df_business['grid_y'] = ((df_business['latitude'] - min_lat) // GRID_SIZE).astype(int)
df_business['grid'] = df_business['grid_x'].astype(str) + '_' + df_business['grid_y'].astype(str) # Grid stored in the format 'x_y' (e.g. 0_0, 1_0, etc.)
# Step 2: Find top restaurant categories
df_business = df_business[df_business['categories'].notnull()] # Handle null values in categories
df_business = df_business[df_business['categories'].str.contains('Restaurants', na=False)] # filter so categories contain restaurants
df_business['category_list'] = df_business['categories'].str.split(',').apply(lambda cats: [c.strip() for c in cats])
exploded_categories = df_business.explode('category_list')
top_categories = exploded_categories['category_list'].value_counts().head(15).index.tolist()
top_categories = [cat for cat in top_categories if cat not in ['Food', 'American (New)', 'American (Traditional)']] # Here we exclude 'food' as too generic and 'American' as the data does not show correctly
# Step 3: Group by grid
df_counts = df_business.groupby('grid').agg(
count=('stars', 'size'),
avg_rating=('stars', 'mean'),
categories=('categories', lambda x: ', '.join(set(x))),
).reset_index()
# Step 4: Create map
map_center = [df_business['latitude'].mean(), df_business['longitude'].mean()]
grid_map = folium.Map(
location=map_center,
zoom_start=12,
min_zoom=11,
max_bounds=True, # Set bounds as no data exists outside the city
min_lat=min_lat - 0.1,
max_lat=df_business['latitude'].max() + 0.1,
min_lon=min_lon - 0.1,
max_lon=df_business['longitude'].max() + 0.1,
)
# Step 5: Create color map
cmap = plt.get_cmap('RdYlGn') # Red to green color map
# Step 6: Iterate categories to create folium feature group for each. I.e. a layer on the map which can be toggled on/off
show = True # Flag to only show the first layer initially
for category in top_categories:
fg = folium.FeatureGroup(name=category, show=show)
show = False
# Step 7: Filter grids and restaurants for the category
category_grids = df_counts[df_counts['categories'].str.contains(category, na=False)]
restaurants_in_cat = exploded_categories[exploded_categories['category_list'] == category]
# Step 8: Iterate over the grids, create rectangles and add to the feature group
for _, row in category_grids.iterrows():
grid_x, grid_y = map(int, row['grid'].split('_')) # Split the grid string into x and y coordinates
lat = min_lat + (grid_y + 0.5) * GRID_SIZE
lon = min_lon + (grid_x + 0.5) * GRID_SIZE
cell_restaurants = restaurants_in_cat[restaurants_in_cat['grid'] == row['grid']]
# Calculate the average rating for the grid cell and normalize it to [0, 1] for color mapping
color = matplotlib.colors.rgb2hex(cmap((cell_restaurants['stars'].mean() - 1) / 4)) # Normalize the average rating to [0, 1] for color mapping
if len(cell_restaurants) > 2: # Only plot rectangle if there are more than 2 restaurants in the grid cell
folium.Rectangle(
bounds=[[lat - GRID_SIZE/2, lon - GRID_SIZE/2], [lat + GRID_SIZE/2, lon + GRID_SIZE/2]],
color=None,
fill=True,
fill_color=color,
fill_opacity=0.5,
).add_to(fg)
# Step 9: Plot restaurant markers in this grid cell
for _, r in cell_restaurants.iterrows():
color = matplotlib.colors.rgb2hex(cmap((r['stars'] - 1) / 4)) # Normalize the restaurant rating to [0, 1] for color mapping
folium.Circle(
location=[r['latitude'], r['longitude']],
radius=GRID_SIZE * 1000, # Adjust size of the circle
color='Black',
weight=1,
opacity=0.5,
fill=True,
fill_color=color,
fill_opacity=0.8,
popup=f"{r['name']}<br>Rating: {r['stars']}<br>Category: {r['category_list']}", # Popup with restaurant name, rating and category
).add_to(fg)
# Step 10: Add the feature group to the map
fg.add_to(grid_map)
# Step 11: Add layer control to toggle feature groups on/off
folium.LayerControl(collapsed=False).add_to(grid_map)
# Step 12: Add legend to the map with rough color scale
# Here we create a custom HTML legend to show the color scale
legend_html = """
{% macro html(this, kwargs) %}
<div style="
position: fixed;
bottom: 50px;
left: 50px;
width: 150px;
height: 170px;
background-color: white;
border:2px solid grey;
z-index:9999;
font-size:14px;
padding: 10px;
">
<b>Avg Rating of Area</b><br>
<i style="background: #a50026; width: 18px; height: 18px; float: left; margin-right: 8px;"></i> 1.0 - 2.0<br>
<i style="background: #f46d43; width: 18px; height: 18px; float: left; margin-right: 8px;"></i> 2.0 - 3.0<br>
<i style="background: #fee08b; width: 18px; height: 18px; float: left; margin-right: 8px;"></i> 3.0 - 4.0<br>
<i style="background: #a6d96a; width: 18px; height: 18px; float: left; margin-right: 8px;"></i> 4.0 - 4.5<br>
<i style="background: #1a9850; width: 18px; height: 18px; float: left; margin-right: 8px;"></i> 4.5 - 5.0<br>
</div>
{% endmacro %}
"""
legend = MacroElement()
legend._template = Template(legend_html)
grid_map.get_root().add_child(legend)
# Step 13: add title to the map
title_html = """
{% macro html(this, kwargs) %}
<div style="
position: fixed;
top: 10px;
left: 50%;
transform: translateX(-50%);
width: auto;
height: auto;
background-color: white;
border:2px solid grey;
z-index:9999;
font-size:16px;
padding: 10px;
">
<b>Philadelphia Restaurant Ratings</b><br>
</div>
{% endmacro %}
"""
title = MacroElement()
title._template = Template(title_html)
grid_map.get_root().add_child(title)
# Step 14: Show the map
grid_map